"Advanced Graphics and Data Visualization in R" is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV-2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.
This lesson is the fifth in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.
The structure of the class is a code-along style in Jupyter notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto Jupyter Hub so students can program along with the instructor.
This week will focus on exploring the relationships within your data observations comparing between experimental samples and applying a suite of cluster, dimensions reduction and projection algorithms.
At the end of this lecture you will have covered visualizations related to
grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.
Today's datasets will focus on the smaller datasets we worked on in earlier lectures (namely our Ontario public health unit COVID-19 demographics data), and a new set of RNAseq analysis on different tissue samples from COVID-19 patients
This is a combination of datasets from previous lectures. This incorporates PHU demographic data with StatsCan census data from 2017 to produce a normalized estimate of cases, deaths, and hospitalizations across age groups and public health units in Ontario.
This is the same readcount data we looked at in Lecture 04. RNA-Seq read count data generated from SARS-CoV2 infections of AEC cells. Used to compare the timecourse of expression (pro-inflammatory) changes in samples treated with and without HSP90 inhibitors. Published in iScience doi: https://doi.org/10.1016/j.isci.2021.102151
From Desai et al., 2020 on medRxiv doi: https://doi.org/10.1101/2020.07.30.20165241 this dataset has normalized expression counts from RNAseq data. It covers multiple samples and tissues from COVID-positive patients with a focus on lung tissue. The expression data has been unfiltered for SARS-CoV-2 expression data as well.
From Desai et al., 2020 on medRxiv doi: https://doi.org/10.1101/2020.07.30.20165241 this dataset contains patient information like viral load from tissues that were used for RNAseq.
repr- a package useful for altering some of the attributes of objects related to the R kernel.
tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2
viridis helps to create color-blind palettes for our data visualizations
RColorBrewer has some hlepful palettes that we'll need to colour our data.
gplots will be used to help generate heatmap/dendrogram visualizations.
FactoMineR and factoextra will be used for PCA generation and visualization
Rtsne and umap are packages implementing non-linear projection algorithms
install.packages("gplots")
install.packages("FactoMineR")
install.packages("factoextra")
install.packages("Rtsne")
install.packages("umap")
# Packages to help tidy our data
library(tidyverse)
library(readxl)
# Packages for the graphical analysis section
library(repr)
library(viridis)
library(gplots)
library(RColorBrewer)
# Useful for PCA and PCA visualization
library(FactoMineR)
library(factoextra)
# Data projection packages
library(Rtsne)
library(umap)
Warning message: "package 'tidyverse' was built under R version 4.0.5" -- Attaching packages --------------------------------------- tidyverse 1.3.1 -- v ggplot2 3.3.5 v purrr 0.3.4 v tibble 3.1.6 v dplyr 1.0.8 v tidyr 1.2.0 v stringr 1.4.0 v readr 2.1.2 v forcats 0.5.1 Warning message: "package 'ggplot2' was built under R version 4.0.5" Warning message: "package 'tibble' was built under R version 4.0.5" Warning message: "package 'tidyr' was built under R version 4.0.5" Warning message: "package 'readr' was built under R version 4.0.5" Warning message: "package 'dplyr' was built under R version 4.0.5" Warning message: "package 'forcats' was built under R version 4.0.5" -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() Warning message: "package 'repr' was built under R version 4.0.5" Warning message: "package 'viridis' was built under R version 4.0.5" Loading required package: viridisLite Warning message: "package 'viridisLite' was built under R version 4.0.5" Warning message: "package 'gplots' was built under R version 4.0.5" Attaching package: 'gplots' The following object is masked from 'package:stats': lowess Warning message: "package 'FactoMineR' was built under R version 4.0.5" Warning message: "package 'factoextra' was built under R version 4.0.5" Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa Warning message: "package 'Rtsne' was built under R version 4.0.5" Warning message: "package 'umap' was built under R version 4.0.5"
Last week we looked at an analysis of RNAseq data through a number of methods starting with broad-level volcano plots and moving towards gene-level expression visualizations with dotplots. In between we stopped to take a look at heatmaps. In this instance we simply used heatmaps to convey expression levels of multiple genes across one or more samples.
Looking more broadly, we now wish to ask questions such as "how similar are our samples?", "can our samples be grouped or categorized in some way?" and "is there an underlying structure or architecture to our data?"
We can study these espects using a number of techniques that range from clustering/categorization to projection of high-dimensional data onto a lower set of dimensions (usually 2 or 3).
We cannot begin our journey until we talk about the nature of the data we are interested in examining. Usually, our data will consist of many samples (observations) with some number of features captured about each observation. For example, with RNAseq data we could consider the measurements we capture for each gene as a separate feature/variable/column.
Conversely you may have hundreds or thousands of samples you'd like to categorize in some way (or show that you can categorize) with just a smaller set of features. For every nut, there is a tool of sorts for cracking it!
We'll start with a modified version of the PHU age group dataset that we've seen before from Lecture 02 and 03. The demographics data has been modified to include a set of normalized values (individuals per 100k) that was calculated based on StatsCan 2017 census data. Modeling off of these data, the 2022 sizes for each age group have been estimated and case, death, and hospitalization counts have been normalized based on these subgroup sizes.
The updated dataset will be found in phu_demographics_census_norm.csv and we'll use it to guide us through the two sections of today's lecture.
Let's begin by loading the data and taking a look at its structure.
# Import the normalized demographics data
covid_demographics_norm.df = read_csv("./data/phu_demographics_census_norm.csv")
# Look at it's structure
str(covid_demographics_norm.df, give.attr = FALSE)
Rows: 238 Columns: 15 -- Column specification -------------------------------------------------------- Delimiter: "," chr (4): from_date, to_date, public_health_unit, age_group dbl (11): total_cases, total_deaths, total_hospitalizations_count, percent_c... i Use `spec()` to retrieve the full column specification for this data. i Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec_tbl_df [238 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ from_date : chr [1:238] "January 15, 2020" "January 15, 2020" "January 15, 2020" "January 15, 2020" ... $ to_date : chr [1:238] "March 2, 2022" "March 2, 2022" "March 2, 2022" "March 2, 2022" ... $ public_health_unit : chr [1:238] "Algoma" "Algoma" "Algoma" "Algoma" ... $ age_group : chr [1:238] "0 to 4" "12 to 19" "20 to 39" "40 to 59" ... $ total_cases : num [1:238] 181 412 1868 1424 379 ... $ total_deaths : num [1:238] 0 0 1 5 0 14 13 0 0 2 ... $ total_hospitalizations_count: num [1:238] 9 5 31 48 0 69 44 10 2 35 ... $ percent_cases : num [1:238] 0.0343 0.0781 0.3543 0.2701 0.0719 ... $ percent_deaths : num [1:238] 0 0 0.0303 0.1515 0 ... $ percent_hospitalizations : num [1:238] 0.0437 0.0243 0.1505 0.233 0 ... $ population_2022 : num [1:238] 117840 117840 117840 117840 117840 ... $ individuals_2022 : num [1:238] 5481 9230 24676 33111 7751 ... $ cases_per_100k : num [1:238] 3302 4464 7570 4301 4890 ... $ deaths_per_100k : num [1:238] 0 0 4 15 0 46 177 0 0 5 ... $ hospitalizations_per_100k : num [1:238] 164 54 126 145 0 228 598 116 13 95 ...
Let's begin looking at our covid_demographics_norm.df. Using a series of data sets, we've created a consolidated dataset with:
The question we want to answer is: of the 34 public health units, which look most similar based on the normalized data for each age group? In order to visualize this data, we'll want to convert our current long-form data that looks something like this:
Our starting table:
| public_health_unit | age_group | total_cases | ... | cases_per_100k | deaths_per_100k | hospitalizations_per_100k |
|---|---|---|---|---|---|---|
| Algoma | 0 to 4 | 181 | ... | 3302 | 0 | 164 |
| Algoma | 12 to 19 | 412 | ... | 4464 | 0 | 54 |
| ... | ... | ... | ... | ... | ... | ... |
Our end result:
| public_health_unit | population_2022 | 0 to 4 | 5 to 11 | 12 to 19 | 20 to 39 | 40 to 59 | 60 to 79 | 80+ |
|---|---|---|---|---|---|---|---|---|
| Algoma | 117840 | 3302 | 4464 | 7570 | 4301 | 4890 | 2331 | 4116 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
We need to do the following to the dataset
# Create a wide-format version of our normalized data
covid_demographics_norm_wide.df <-
# Start by passing along the long-form normalized data
covid_demographics_norm.df %>%
# Select for just the PHU, age_group, population size and cases_per_100k
select(c(3,4,11,13)) %>%
# Pivot the age_group/cases_per_100k data out wider
pivot_wider(names_from = age_group,
values_from = cases_per_100k
) %>%
# Move the "5 to 11" category to after "0 to 4". You could use a longer "select" call to do this too!
relocate(., `5 to 11`, .after=`0 to 4`)
# Take a look at our resulting dataframe
head(covid_demographics_norm_wide.df)
| public_health_unit | population_2022 | 0 to 4 | 5 to 11 | 12 to 19 | 20 to 39 | 40 to 59 | 60 to 79 | 80+ |
|---|---|---|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| Algoma | 117840 | 3302 | 4890 | 4464 | 7570 | 4301 | 2331 | 4116 |
| Brant County | 153558 | 4481 | 5376 | 6499 | 9448 | 6429 | 3995 | 5684 |
| Durham Region | 711426 | 4669 | 5697 | 6407 | 11018 | 7333 | 4985 | 7562 |
| Grey Bruce | 176150 | 2324 | 3040 | 4643 | 5995 | 3489 | 2022 | 3056 |
| Haldimand-Norfolk | 120008 | 3002 | 4956 | 5388 | 9803 | 5799 | 3568 | 6261 |
| Haliburton, Kawartha, Pine Ridge District | 190728 | 2263 | 2865 | 3673 | 7368 | 3681 | 1923 | 3612 |
At this point, we would normally be prepared to visualize our data with ggplot but our data is in wide-format and we'll be using a package that prefers to work with matrix data. In that case, we need to strip the data down further because matrix data must be all of the same type and we want to work with numeric data! We'll use as.matrix() for our conversion.
public_health_unit information over to the row names of our matrix so we can still track our data properly.population_2022 column of data but we'll have to remember that it's there.# Now we need to make our matrix and assign row names.
# It's kind of awkward to need this requirement.
# Cast our matrix and drop the first column
covid_demographics_norm.mx <- as.matrix(covid_demographics_norm_wide.df[, -1])
# Set the row names using the information from the data frame
rownames(covid_demographics_norm.mx) <- covid_demographics_norm_wide.df$public_health_unit
head(covid_demographics_norm.mx)
| population_2022 | 0 to 4 | 5 to 11 | 12 to 19 | 20 to 39 | 40 to 59 | 60 to 79 | 80+ | |
|---|---|---|---|---|---|---|---|---|
| Algoma | 117840 | 3302 | 4890 | 4464 | 7570 | 4301 | 2331 | 4116 |
| Brant County | 153558 | 4481 | 5376 | 6499 | 9448 | 6429 | 3995 | 5684 |
| Durham Region | 711426 | 4669 | 5697 | 6407 | 11018 | 7333 | 4985 | 7562 |
| Grey Bruce | 176150 | 2324 | 3040 | 4643 | 5995 | 3489 | 2022 | 3056 |
| Haldimand-Norfolk | 120008 | 3002 | 4956 | 5388 | 9803 | 5799 | 3568 | 6261 |
| Haliburton, Kawartha, Pine Ridge District | 190728 | 2263 | 2865 | 3673 | 7368 | 3681 | 1923 | 3612 |
heatmap.2()¶From the gplots package we can use the function heatmap.2() to generate a heatmap that can do a little more than what we were doing with our ggplot2 version. The nice aspect of using heatmap.2() is that we can ask it to reorder our samples along both the rows and columns. At the same time we can present the relationship between columns elements or row elements in the form of dendrograms.
In its current form, we have our ordered our data along the columns in an ascending ordinal arrangement. While this makes sense to us now, it may not help to visually identify the strongest trends or groupings in our data. Clustering attempts to bring order to our data by grouping data according to specific algorithms.
Overall single points are usually grouped together as neighbours with the "nearest" neighbours determined by a metric of some kind. These clusters are then further grouped using the same metrics until the entire set is presented in these ordered groups. These groupings/relationships can also be presented as dendrograms. In our current case, we aren't concerned with determining groups per se but rather with just connecting them by their similarity and then creating a hierarchical order.
Our data is usually coded in n-dimensional space, depending on the nature of our dataset. For covid_demographics_norm.mx our 7 columns are coded by data from 34 PHUs meaning each column is coded in 34 dimensions or 34 features. Conversely, our 34 rows represent PHUs each coded by data across 7 age groups and therefore 7 dimensions.
Important or helpful parameters to run heatmap.2()
x: our matrix object. It must be a matrix, and not a data frame.trace: draws a line down the rows, columns or both indiciting how far (proportionately) the value strays from the mean.cexRow and cexCol: set our row and column text size respectively.key: a logical indicating whether or not the color-key will be shown.col: determines the colours used for the image.dendrogram: Include a dendrogram of the data. Values can be "none", "row", "column" or "both". Let's plot our data with and without a dendrogram to compare.
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)
# Plot our data without ordering. So it will be ordered as-is.
heatmap.2(x=(covid_demographics_norm.mx[,-1]), # get rid of our population information
main = "Heatmap of per 100k infection rates for PHU age groups - no dendrogram",
col = viridis(100), # Use the viridis colour scheme
cexRow = 2, # Set row text size
cexCol = 2, # Set column text size
lhei = c(2,10), # Alter our key heigh slightly
lwid = c(2, 10),
margins=c(8,32), # Increase text margins because our PHU names can be big
trace = "none", # Get rid of the trace line down columns
Rowv = FALSE, Colv = "Rowv", # Here we've turned off the dendrograms
dendrogram = "none" # Optional to turn off warnings
)
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)
heatmap.2(x=(covid_demographics_norm.mx[,-1]), # get rid of our population information
main = "Heatmap of per 100k infection rates for PHU age groups - with a dendrogram",
col = viridis(100), # Use the viridis colour scheme
cexRow = 2, # Set row text size
cexCol = 2, # Set column text size
lhei = c(2,10), # Alter our key heigh slightly
margins=c(8,28), # Increase text margins because our PHU names can be big
trace = "none", # Get rid of the trace line down columns
dendrogram = "both"
)
Our heatmap above is drawn from a relatively simple dataset but it helps us to see that there are some strong signals that differentiate between our different PHUs. Now this is a 34x7 grid where we can investigate all of the data in our dataset. What happens when we want to produce this kind of data from something more complex?
Recall from last lecture we investigated read count data from Wyler et al., 2020 - Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv. We used this dataset to produce a scatterplot matrix to compare between some of the replicate data within the set. Across this set there are 36 sets of RNA-Seq data spanning 27,011 genes.
Let's begin by opening up the data file and getting a quick reminder of what it looks like.
# Read in your read_count data
wyler_readcounts.df <- read_tsv("./data/Wyler2020_AEC_SARSCoV2_17AAG_readcounts.tsv")
str(wyler_readcounts.df, give.attr = FALSE)
Rows: 27011 Columns: 38 -- Column specification -------------------------------------------------------- Delimiter: "\t" chr (1): gene dbl (37): width, AECII_01_24h_DMSO-1, AECII_02_24h_DMSO-2, AECII_03_24h_DMSO... i Use `spec()` to retrieve the full column specification for this data. i Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec_tbl_df [27,011 x 38] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ gene : chr [1:27011] "A1BG" "A1BG-AS1" "A1CF" "A2M" ... $ width : num [1:27011] 1766 2130 9529 4653 2187 ... $ AECII_01_24h_DMSO-1 : num [1:27011] 22 55 0 0 0 ... $ AECII_02_24h_DMSO-2 : num [1:27011] 19 20 0 1 0 ... $ AECII_03_24h_DMSO-3 : num [1:27011] 15 38 0 0 2 ... $ AECII_04_24h_200nM17AAG-1 : num [1:27011] 29 11 0 0 2 4740 0 0 882 0 ... $ AECII_05_24h_200nM17AAG-2 : num [1:27011] 27 15 0 0 0 ... $ AECII_06_24h_200nM17AAG-3 : num [1:27011] 30 17 0 0 2 4220 0 0 820 0 ... $ AECII_07_24h_S2_DMSO-1 : num [1:27011] 24 41 0 0 0 ... $ AECII_08_24h_S2_DMSO-2 : num [1:27011] 18 46 0 0 2 ... $ AECII_09_24h_S2_DMSO-3 : num [1:27011] 22 42 0 0 4 2750 0 0 739 0 ... $ AECII_10_24h_S2_200nM17AAG-1: num [1:27011] 23 13 0 1 3 ... $ AECII_11_24h_S2_200nM17AAG-2: num [1:27011] 25 24 0 0 1 ... $ AECII_12_24h_S2_200nM17AAG-3: num [1:27011] 18 28 1 0 1 ... $ AECII_13_48h_DMSO-1 : num [1:27011] 24 57 0 0 0 ... $ AECII_14_48h_DMSO-2 : num [1:27011] 38 56 0 0 0 1930 0 0 892 0 ... $ AECII_15_48h_DMSO-3 : num [1:27011] 33 64 0 0 0 ... $ AECII_16_48h_200nM17AAG-1 : num [1:27011] 29 35 0 0 1 ... $ AECII_17_48h_200nM17AAG-2 : num [1:27011] 20 24 0 0 0 ... $ AECII_18_48h_200nM17AAG-3 : num [1:27011] 23 28 0 0 0 ... $ AECII_19_48h_S2_DMSO-1 : num [1:27011] 13 45 0 0 0 ... $ AECII_20_48h_S2_DMSO-2 : num [1:27011] 22 36 0 0 0 ... $ AECII_21_48h_S2_DMSO-3 : num [1:27011] 21 30 0 0 1 ... $ AECII_22_48h_S2_200nM17AAG-1: num [1:27011] 31 29 0 0 1 ... $ AECII_23_48h_S2_200nM17AAG-2: num [1:27011] 36 23 0 0 1 ... $ AECII_24_48h_S2_200nM17AAG-3: num [1:27011] 20 19 1 0 0 ... $ AECII_25_72h_DMSO-1 : num [1:27011] 35 43 0 1 2 ... $ AECII_26_72h_DMSO-2 : num [1:27011] 26 38 0 0 1 997 0 0 753 0 ... $ AECII_27_72h_DMSO-3 : num [1:27011] 29 53 0 0 4 1230 0 0 953 0 ... $ AECII_28_72h_200nM17AAG-1 : num [1:27011] 32 57 0 0 2 ... $ AECII_29_72h_200nM17AAG-2 : num [1:27011] 41 49 0 0 2 1050 0 0 553 0 ... $ AECII_30_72h_200nM17AAG-3 : num [1:27011] 33 37 0 0 3 857 0 0 458 0 ... $ AECII_31_72h_S2_DMSO-1 : num [1:27011] 24 58 0 0 1 ... $ AECII_32_72h_S2_DMSO-2 : num [1:27011] 33 63 0 0 3 ... $ AECII_33_72h_S2_DMSO-3 : num [1:27011] 23 47 1 1 2 ... $ AECII_34_72h_S2_200nM17AAG-1: num [1:27011] 22 39 0 1 1 ... $ AECII_35_72h_S2_200nM17AAG-2: num [1:27011] 28 26 0 0 2 866 0 0 481 0 ... $ AECII_36_72h_S2_200nM17AAG-3: num [1:27011] 43 34 0 0 3 ...
As you might recall from last week, there is quite a bit of data in this set. So that we can more easily visualize our data, we'll want to wrangle it a bit more by:
wyler_readcounts_filtered.df <-
wyler_readcounts.df %>%
# Rename the columns be removing the first portion: AECII_xx
rename_with(., ~ str_replace(string = .x,
pattern = r"(\w*_\d*_)",
replace = "")) %>%
# filter out the low-readcount data
filter(if_all(.cols = -1, .fns = ~ .x > 1000 & .x < 3000))
# Create our data matrix
wyler_readcounts.mx <- as.matrix(wyler_readcounts_filtered.df[, 3:38])
# Set the row names using the information from the data frame
rownames(wyler_readcounts.mx) <- wyler_readcounts_filtered.df$gene
# Check the characteristics of our matrix
head(wyler_readcounts.mx)
str(wyler_readcounts.mx)
| 24h_DMSO-1 | 24h_DMSO-2 | 24h_DMSO-3 | 24h_200nM17AAG-1 | 24h_200nM17AAG-2 | 24h_200nM17AAG-3 | 24h_S2_DMSO-1 | 24h_S2_DMSO-2 | 24h_S2_DMSO-3 | 24h_S2_200nM17AAG-1 | ... | 72h_DMSO-3 | 72h_200nM17AAG-1 | 72h_200nM17AAG-2 | 72h_200nM17AAG-3 | 72h_S2_DMSO-1 | 72h_S2_DMSO-2 | 72h_S2_DMSO-3 | 72h_S2_200nM17AAG-1 | 72h_S2_200nM17AAG-2 | 72h_S2_200nM17AAG-3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AP1S1 | 1957 | 1310 | 1623 | 2520 | 2448 | 2618 | 1644 | 1585 | 1741 | 2536 | ... | 1953 | 1835 | 1738 | 1490 | 1571 | 1657 | 1394 | 1531 | 1331 | 1633 |
| AP2S1 | 1575 | 1162 | 1408 | 1669 | 1441 | 1643 | 1187 | 1200 | 1379 | 1551 | ... | 1561 | 1719 | 1489 | 1318 | 1434 | 1389 | 1219 | 1522 | 1266 | 1536 |
| APH1A | 2017 | 1533 | 1890 | 2281 | 2089 | 2266 | 1794 | 1697 | 1943 | 2043 | ... | 2253 | 2532 | 2373 | 2010 | 2329 | 2316 | 2109 | 2306 | 1804 | 2375 |
| ARL8B | 1835 | 1506 | 1697 | 2823 | 2708 | 2937 | 1616 | 1566 | 1812 | 2994 | ... | 2043 | 2927 | 2713 | 2275 | 2181 | 2194 | 1903 | 2658 | 2072 | 2629 |
| ARPC4 | 2503 | 1651 | 2005 | 1732 | 1603 | 1663 | 2051 | 1899 | 2285 | 1602 | ... | 2225 | 1931 | 1812 | 1658 | 2027 | 2095 | 1664 | 1990 | 1549 | 1899 |
| ATP6AP1 | 1783 | 1373 | 1714 | 2618 | 2426 | 2638 | 1503 | 1467 | 1673 | 2291 | ... | 2524 | 2231 | 2006 | 1709 | 2619 | 2713 | 2368 | 1685 | 1560 | 1855 |
num [1:109, 1:36] 1957 1575 2017 1835 2503 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:109] "AP1S1" "AP2S1" "APH1A" "ARL8B" ... ..$ : chr [1:36] "24h_DMSO-1" "24h_DMSO-2" "24h_DMSO-3" "24h_200nM17AAG-1" ...
What if we were to produce a heatmap directly with the raw data? It would be impossible to read, and including a dendrogram would actually take forever to process given the 27,011 rows of data to process across the 36 datasets. That's why we'll use our small subset of data as an example to build a heatmap.
Let's plot the data now with heatmap.2 and see how it looks with a dendrogram.
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=30)
# Plot a subset of our RNA-Seq data with some ordering.
heatmap.2(x=wyler_readcounts.mx, # get rid of our population information
main = "Heatmap of Wyler et al., 2020 readcount data",
col = viridis(100), # Use the viridis colour scheme
cexRow = 1, # Set row text size
cexCol = 2, # Set column text size
lhei = c(2,10), # Alter our key heigh slightly
lwid = c(2, 10),
margins=c(20,20), # Increase text margins because our PHU names can be big
trace = "none", # Get rid of the trace line down columns
dendrogram = "both" # Optional to turn off warnings
)
From the above output, we can see that our heatmap is already quite hard to read and that's coming off of using just 109 genes! We do, however, get some strong trends regarding our datasets - we can see certain sets of replicates are grouped together in the data but not all of them. This might be clearer if we were to factor in all of the datapoints or just the relevant genes that define the differences between datasets. We'll discuss the second part of that thought later in this lecture.
Recall that what we're really interested in, regarding these readcount data, is to ask just how similar the experiments are between each other. Whereas we were a little limited by the space needed to produce the scatterplot matrix, we were able to produce correlation values between experiments on a small scale. We can extend that visualization forward to compare between all datasets and then visually summarize the data as a heatmap.
The portion of the scatterplot matrix we'll extend is the correlation values. Whereas the ggpairs() function uses a Pearson correlation, we can choose between Pearson or Spearman correlation. Sometimes you may wish to compare both since Pearson examines linear relationships whereas Spearman uses a ranking method to compare variables for monotonic relationships.
To generate our matrix we'll employ a couple of additional functions:
expand.grid(): we can use this to build a matrix of pair-wise combination values. We'll use these to generate the pair-wise column combinations we want to compare with cor().apply(): by now you should be familiar with this function. We'll use it to iterate through our pair-wise column combinations and send each of those to...cor(): this will return the correlation co-efficient based on the method we choose (pearson, kendall, spearman)# Example of how you can use expand.grid to make pairwise combination between two sets of data.
expand.grid(c(1:3),c(4:6))
| Var1 | Var2 |
|---|---|
| <int> | <int> |
| 1 | 4 |
| 2 | 4 |
| 3 | 4 |
| 1 | 5 |
| 2 | 5 |
| 3 | 5 |
| 1 | 6 |
| 2 | 6 |
| 3 | 6 |
# First we need to fix the column names from our data
wyler_readcounts_only.df <-
wyler_readcounts.df %>%
# Rename the columns be removing the first portion: AECII_xx
rename_with(., ~ str_replace(string = .x,
pattern = r"(\w*_\d*_)",
replace = "")) %>%
# Drop the first two columns as well
select(c(3:38))
# Create our correlation matrix
wyler_cor.mx <- matrix(apply(expand.grid(c(1:36),c(1:36)), # generate the pair-wise combinations
MARGIN = 1, # explore it row-by-row
function(x) cor(wyler_readcounts_only.df[, x[1]],
wyler_readcounts_only.df[, x[2]],
method = "pearson") # Pearson correlation
),
nrow = 36, # Cast our result as a matrix
dimnames = list(colnames(wyler_readcounts_only.df), # name the rows and columns
colnames(wyler_readcounts_only.df))
)
# take a look at the final matrix
head(wyler_cor.mx)
| 24h_DMSO-1 | 24h_DMSO-2 | 24h_DMSO-3 | 24h_200nM17AAG-1 | 24h_200nM17AAG-2 | 24h_200nM17AAG-3 | 24h_S2_DMSO-1 | 24h_S2_DMSO-2 | 24h_S2_DMSO-3 | 24h_S2_200nM17AAG-1 | ... | 72h_DMSO-3 | 72h_200nM17AAG-1 | 72h_200nM17AAG-2 | 72h_200nM17AAG-3 | 72h_S2_DMSO-1 | 72h_S2_DMSO-2 | 72h_S2_DMSO-3 | 72h_S2_200nM17AAG-1 | 72h_S2_200nM17AAG-2 | 72h_S2_200nM17AAG-3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24h_DMSO-1 | 1.0000000 | 0.9916027 | 0.9971539 | 0.8689104 | 0.8698985 | 0.8664966 | 0.9978794 | 0.9981499 | 0.9978888 | 0.8777276 | ... | 0.9187739 | 0.7915054 | 0.7936938 | 0.7961030 | 0.7952450 | 0.8135576 | 0.7894876 | 0.7982301 | 0.7965116 | 0.8005285 |
| 24h_DMSO-2 | 0.9916027 | 1.0000000 | 0.9977965 | 0.8985301 | 0.8982797 | 0.8951874 | 0.9927109 | 0.9941380 | 0.9903860 | 0.9033440 | ... | 0.9561826 | 0.8172553 | 0.8191693 | 0.8210994 | 0.8311898 | 0.8442450 | 0.8243365 | 0.8219231 | 0.8192079 | 0.8235471 |
| 24h_DMSO-3 | 0.9971539 | 0.9977965 | 1.0000000 | 0.8856510 | 0.8857412 | 0.8825415 | 0.9973488 | 0.9980146 | 0.9957036 | 0.8917670 | ... | 0.9404527 | 0.8046052 | 0.8065164 | 0.8088835 | 0.8157015 | 0.8315687 | 0.8095542 | 0.8098321 | 0.8078603 | 0.8121372 |
| 24h_200nM17AAG-1 | 0.8689104 | 0.8985301 | 0.8856510 | 1.0000000 | 0.9991436 | 0.9993164 | 0.8698156 | 0.8708104 | 0.8599358 | 0.9965957 | ... | 0.9259875 | 0.9071223 | 0.9061716 | 0.9071157 | 0.7929273 | 0.7942655 | 0.7833115 | 0.9038799 | 0.8992139 | 0.9050855 |
| 24h_200nM17AAG-2 | 0.8698985 | 0.8982797 | 0.8857412 | 0.9991436 | 1.0000000 | 0.9992593 | 0.8699711 | 0.8713050 | 0.8609952 | 0.9971129 | ... | 0.9236686 | 0.9030784 | 0.9022990 | 0.9031301 | 0.7878839 | 0.7896596 | 0.7781874 | 0.9010112 | 0.8955725 | 0.9018463 |
| 24h_200nM17AAG-3 | 0.8664966 | 0.8951874 | 0.8825415 | 0.9993164 | 0.9992593 | 1.0000000 | 0.8667858 | 0.8674853 | 0.8567663 | 0.9965608 | ... | 0.9231860 | 0.9081182 | 0.9071450 | 0.9082667 | 0.7861012 | 0.7870452 | 0.7760100 | 0.9054197 | 0.9006033 | 0.9066919 |
Now that we've completed the correlation matrix and it appears to be correct, we can generate the heatmap of the data, allowing it to group data based on the Pearson correlation values.
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)
# Create a heatmap of the correlation values
heatmap.2(x=wyler_cor.mx, # Apply our correlation matrix
main = "Heatmap of RNA-Seq Pearson correlation on readcounts",
col = viridis(100), # Use the viridis colour scheme
cexRow = 2, # Set row text size
cexCol = 2, # Set column text size
lhei = c(2,10), # Alter our key heigh slightly
lwid = c(3,10),
margins=c(20,20), # Increase text margins because our PHU names can be big
trace = "none", # Get rid of the trace line down columns
dendrogram = "both" # Optional to turn off warnings
)
hclust()¶As you can see, clustering our data makes a big difference in how our data is displayed. In our last few heatmaps, we allowed heatmap.2() to generate it's own clustering. We could have supplied it with a specific dendrogram to order the data as well. Under the hood, the function is actually calling on the hclust() function to produce the dendrograms.
The choice of method determines how the data is clustered. The choices include: ward.D, ward.D2, single, complete (default), average, mcquitty, median and centroid.
In general ward.D, ward.D2 and complete strategies try to find compact small "spherical" clusters. The single linkage adopts a 'friends of friends' clustering strategy. The other methods can be said to aim for somewhere in between. For more information on these, you can dig into the hclust() documentation
We can turn to hclust() to generate just dendrograms for us but prior to that we still need to reformat our matrix a little using the dist() function.
dist()¶Remember we talked about our data being in n-dimensional space? Using those coordinates, there are a number of ways to calculate the distance between points. The simplest form is to calculate the euclidean distance between points using the generic formula:
$$d(p,q) = \sqrt{(p_{1}-q_{2})^2 + (p_{2}-q_{2})^2 + \ldots + (p_{n}-q_{n})^2}$$but there are a number of other options as well. For isntance you can also choose the maximum distance between two components (same dimension). The dist() function can generate these values for you using the parameter method to determine how the distance will be used. Your options are: euclidean, maximum, manhattan, canberra, binary or minkowski.
The dist() function will return a dist object which is a lower triangle distance matrix between all of the rows/observations using the columns as coordinates.
Let's return to our PHU data to try it out and see how it works.
dist(covid_demographics_norm.mx[,-1],
method="euclidean") %>% str()# The default method is euclidean
'dist' num [1:561] 4365 6804 2961 3783 2579 ... - attr(*, "Size")= int 34 - attr(*, "Labels")= chr [1:34] "Algoma" "Brant County" "Durham Region" "Grey Bruce" ... - attr(*, "Diag")= logi FALSE - attr(*, "Upper")= logi FALSE - attr(*, "method")= chr "euclidean" - attr(*, "call")= language dist(x = covid_demographics_norm.mx[, -1], method = "euclidean")
Now we are ready to cluster our distance matrix using the hclust() method
Parameters that are important
method: already described as the method which we want to use to cluster our data.k: the number of groups we would like our final data to be split into - represented by colours.# Make a cluster object and take a look at it
phu.hc <- hclust(dist(covid_demographics_norm.mx[,-1]), method = "ward.D2")
# What does our hclust object contain?
str(phu.hc)
# phu.hc$merge
List of 7 $ merge : int [1:33, 1:2] -13 -9 -18 -25 -3 -2 -4 -7 -5 -29 ... $ height : num [1:33] 870 1178 1304 1364 1433 ... $ order : int [1:34] 1 4 9 21 6 13 20 28 16 23 ... $ labels : chr [1:34] "Algoma" "Brant County" "Durham Region" "Grey Bruce" ... $ method : chr "ward.D2" $ call : language hclust(d = dist(covid_demographics_norm.mx[, -1]), method = "ward.D2") $ dist.method: chr "euclidean" - attr(*, "class")= chr "hclust"
fviz_dend() from the package factoextra¶Now that we have generated our clustering object, you can see that it holds a number of features including the height (length) of our branches, an ordering of our PHUs and the order in which they merge. The merge process is described in more detail as well with the hclust() documentation
To simplify the plotting process, we can use the fviz_dend() function which will parse through the hclust object to produce a proper dendrogram. We can even specify some grouping or colouring schemes based on how many groups, k, we believe are present in our data.
We can treat the resulting plot like a ggplot to update particular elements as well.
# Plot out our dendrogram with fviz_dend
fviz_dend(phu.hc, # provide our hclust object
cex = 1.5, # set the text size to be larger
k = 3, # How many clusters do we expect in our data
palette = "Set1", # what colours will we use. Many options including RBrewer palettes
horiz = TRUE,
labels_track_height = 12000,
) +
# Bump up the text size for my old eyes
theme(text = element_text(size = 20))
Warning message: "`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead."
So we've visualized our dataset as a dendrogram and it gives us an idea of how the samples might be related (in n-dimension space) based on the values we produced from normalization.
K-means clustering is unsupervised learning for uncategorized data. We don't really know the training labels of our data and are more interested in seeing which ones group together. How many clusters should we aim to generate? We've already seen that our data likely splits into 3 groups based on the heatmaps and hclust() data. Will we get the same thing with a k-means method?
When in doubt, you can quickly consult the factoMiner function fviz_nbclust() which can guess how many clusters are ideal for representing your data. Note that it may not produce your ideal number of clusters but if you're stuck, this is a good start.
There are three method options: wss, gap_stat, and silhouette which all aim to minimize the number of clusters.
wss elbow method aims to minimize intracluster variation. How compact is our clustering? A lower WSS is better.
silhouette measures how well an object lies within it's cluster by computing the average distance to other members in its own, $a(i)$ vs other clusters $b(i)$. We evaluate $\frac{b(i) - a(i)}{max\{a(i), b(i)}$ with higher values indicating better clustering.
gap_stat also compares total intra-cluster variation but against a null reference distribution. The optimal value attempts to choose the number of clusters such that the smallest k gap statistics is within one standard deviation of the gap at k+1.
# How many clusters should we generate?
fviz_nbclust(covid_demographics_norm.mx[,-1], kmeans, method="wss") +
theme(text = element_text(size=20))
# How many clusters should we generate?
fviz_nbclust(covid_demographics_norm.mx[,-1], kmeans, method="silhouette") +
theme(text = element_text(size=20))
# How many clusters should we generate?
fviz_nbclust(covid_demographics_norm.mx[,-1], kmeans, method="gap_stat") +
theme(text = element_text(size=20))
kmeans()¶So from three different analyses we didn't really get a concensus on the best number of clusters but it looks like the answer ranges between 1 and 4 clusters. Since our data already suggests 3 clusters, let's start with that. We'll use the kmeans() function to accomplish this.
Similar in idea to hclust(), the kmeans() algorithm attempts to generate k-partioned groups from the data supplied with an emphasis on minimizing the sum of squares from points to the assigned cluster centers. This differs from hclust() which finishes building the entire relationship via dendrogram without actually choosing "clusters".
We're going to also use set.seed() for our random number generation. We tend to think of things as random, but computationally, randomness is built on algorithms. Therefore we can "recreate" our randomness if we use a pre-determined starting point also known as the "seed".
The parameters we're interested in using with kmeans() are:
x the numeric matrix of our datacenters determines the number of clusters we want to producenstart defines how many randomly chosen sets of k centres you'll use to start the analysis. Choosing multiple different sets allows the algorithm to avoid local minima.iter.max is the max number of iterations allowed while trying to converge on the best minimum metric# Compute k-means with k = 3
# Set a seed for reproducibility.
set.seed(123)
# Generate our k-means analysis
phu_cases.km <-
kmeans(scale(covid_demographics_norm.mx[, -1]), # We'll scale our data for this (more on that later too!)
centers = 3,
nstart = 25,
iter.max = 500)
# phu_cases_unscaled.km <- kmeans(covid_demographics_norm.mx[,-11], 3, nstart = 25)
# What is the structure of our kmeans object?
str(phu_cases.km)
List of 9 $ cluster : Named int [1:34] 2 1 1 2 1 2 1 3 2 1 ... ..- attr(*, "names")= chr [1:34] "Algoma" "Brant County" "Durham Region" "Grey Bruce" ... $ centers : num [1:3, 1:7] 0.356 -0.955 0.82 0.349 -1.098 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:3] "1" "2" "3" .. ..$ : chr [1:7] "0 to 4" "5 to 11" "12 to 19" "20 to 39" ... $ totss : num 231 $ withinss : num [1:3] 46.49 13.18 9.07 $ tot.withinss: num 68.7 $ betweenss : num 162 $ size : int [1:3] 18 11 5 $ iter : int 2 $ ifault : int 0 - attr(*, "class")= chr "kmeans"
# K-means clusters showing the group of each individuals
phu_cases.km$cluster
fviz_cluster()¶Notice the structure of this output? It includes cluster which denotes which row belongs to each of the 3 clusters chosen. The centre of each cluster is defined by centers which in this case is a 3x7 array where each row uses 7 values to define their position within our 7-dimension dataset. We can see each cluster's size is 18, 11, and 5 PHUs respectively. Notice, however, that none of the original coordinates for our data are retained in this object.
Rather than use ggplot directly, we'll use the fviz_cluster() function to help us transform the values of our points from a 7-dimension coordinate system to a 2-dimension visual format suitable for our simpler brains. Under the hood fviz_cluster() is performing a principle component analysis to group our data along the first two principal components (more on what that means later).
The parameters we should be concerned with in this function include:
object the partitioning object for our data. In the case of kmeans, it is our kmeans object.
data is the original dataset we used to generate the data. This will be necessary to plot the other data points.
ellipse.type determines how we'll outline our clusters. This comes with a number of options including:
ellipse.level whose default is 0.95.ellipse.level to set their radius.ellipse.level around the centre of each cluster.Let's see how our data has been partitioned by the k-means clustering.
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=15)
# Plot the cluster
# You can treat this object like a ggplot object afterwards
phu_kmeans.plot <-
fviz_cluster(object = phu_cases.km, # our k-means object
data = covid_demographics_norm.mx[, -1], # Our original data needed for PCA to visualize
ellipse.type = "convex",
ggtheme = theme_bw(),
repel=TRUE, # Try to avoid overlapping text
labelsize = 20,
pointsize = 4,
main = "K-means clustering of PHU by normalized case number"
) +
# Set some ggplot theme information
theme(text = element_text(size=20)) +
# Set the colour and fill scheme to viridis
scale_fill_viridis_d() +
scale_colour_viridis_d()
# Print the plot!
phu_kmeans.plot
Warning message: "ggrepel: 10 unlabeled data points (too many overlaps). Consider increasing max.overlaps"
Let's return to the Wyler et al., 2020 readcount data and take a look at it through a different lens. Whereas before we had generated a heatmap of the data based on looking at genes expressed within a readcount range, we'll now take a different approach.
Let's look at the top 500 variable genes in the dataset to help cluster them. To accomplish that we'll want to generate the standard deviation in read count for each row (gene). We'll utilize a few verbs we haven't used before in this class:
rowwise(): in the same class of functions as group_by(), this will subtly alter the tibble so that when we perform math operations on it, these will be completed on a row-by-row basis.c_across(): this helper version of combine (c()) combines values across columns within our tibble.# Review the wyler readcount data
head(wyler_readcounts.df)
| gene | width | AECII_01_24h_DMSO-1 | AECII_02_24h_DMSO-2 | AECII_03_24h_DMSO-3 | AECII_04_24h_200nM17AAG-1 | AECII_05_24h_200nM17AAG-2 | AECII_06_24h_200nM17AAG-3 | AECII_07_24h_S2_DMSO-1 | AECII_08_24h_S2_DMSO-2 | ... | AECII_27_72h_DMSO-3 | AECII_28_72h_200nM17AAG-1 | AECII_29_72h_200nM17AAG-2 | AECII_30_72h_200nM17AAG-3 | AECII_31_72h_S2_DMSO-1 | AECII_32_72h_S2_DMSO-2 | AECII_33_72h_S2_DMSO-3 | AECII_34_72h_S2_200nM17AAG-1 | AECII_35_72h_S2_200nM17AAG-2 | AECII_36_72h_S2_200nM17AAG-3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ... | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| A1BG | 1766 | 22 | 19 | 15 | 29 | 27 | 30 | 24 | 18 | ... | 29 | 32 | 41 | 33 | 24 | 33 | 23 | 22 | 28 | 43 |
| A1BG-AS1 | 2130 | 55 | 20 | 38 | 11 | 15 | 17 | 41 | 46 | ... | 53 | 57 | 49 | 37 | 58 | 63 | 47 | 39 | 26 | 34 |
| A1CF | 9529 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| A2M | 4653 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| A2M-AS1 | 2187 | 0 | 0 | 2 | 2 | 0 | 2 | 0 | 2 | ... | 4 | 2 | 2 | 3 | 1 | 3 | 2 | 1 | 2 | 3 |
| A2ML1 | 5441 | 2721 | 2349 | 2683 | 4740 | 3914 | 4220 | 2534 | 2484 | ... | 1230 | 1248 | 1050 | 857 | 958 | 1163 | 835 | 1174 | 866 | 1018 |
# Save our results into a new dataframe
wyler_readcounts_filtered.df <-
wyler_readcounts.df %>%
# Rename the columns be removing the first portion: AECII_xx
rename_with(., ~ str_replace(string = .x,
pattern = r"(\w*_\d*_)",
replace = "")) %>%
# filter out the low readcount data. Low values will create wild variance easily
filter(if_all(.cols = -1, .fns = ~ .x > 10)) %>%
# Prepare to do row-wise calculations
rowwise() %>%
# Calculate the standard deviation across rows
mutate(stdev = sd(c_across(2:38))) %>%
# Ungroup and sort the data by descending value
ungroup() %>%
arrange(desc(stdev)) %>%
# Take the top 500 most variable genes
slice(1:500)
Now we'll just convert our dataframe into a matrix of values so we can perform our various analyses.
# Save just the RNA-Seq data into a matrix
wyler_readcounts.mx <- as.matrix(wyler_readcounts_filtered.df[, 3:38])
# Set the row names using the information from the data frame
rownames(wyler_readcounts.mx) <- wyler_readcounts_filtered.df$gene
# Take a quick look at the resulting matrix and its properties
head(wyler_readcounts.mx)
str(wyler_readcounts.mx)
| 24h_DMSO-1 | 24h_DMSO-2 | 24h_DMSO-3 | 24h_200nM17AAG-1 | 24h_200nM17AAG-2 | 24h_200nM17AAG-3 | 24h_S2_DMSO-1 | 24h_S2_DMSO-2 | 24h_S2_DMSO-3 | 24h_S2_200nM17AAG-1 | ... | 72h_DMSO-3 | 72h_200nM17AAG-1 | 72h_200nM17AAG-2 | 72h_200nM17AAG-3 | 72h_S2_DMSO-1 | 72h_S2_DMSO-2 | 72h_S2_DMSO-3 | 72h_S2_200nM17AAG-1 | 72h_S2_200nM17AAG-2 | 72h_S2_200nM17AAG-3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KRT6A | 129671 | 71949 | 100638 | 37054 | 33900 | 34311 | 109804 | 102597 | 124371 | 41077 | ... | 46340 | 4189 | 4187 | 3485 | 44290 | 51827 | 38740 | 4764 | 3735 | 4345 |
| EEF1A1 | 73954 | 55211 | 65183 | 115435 | 108071 | 116563 | 59765 | 56679 | 65284 | 109138 | ... | 93440 | 73011 | 69767 | 57666 | 45198 | 44648 | 38619 | 69210 | 53904 | 68927 |
| PSAP | 20807 | 15808 | 18643 | 47679 | 42159 | 45534 | 17508 | 16545 | 18473 | 45529 | ... | 39455 | 80703 | 75471 | 61679 | 30463 | 28862 | 26229 | 72179 | 57740 | 72614 |
| LAMC2 | 95498 | 54101 | 71731 | 62764 | 60630 | 64633 | 79476 | 77353 | 96356 | 70394 | ... | 49602 | 38868 | 37849 | 29147 | 54168 | 60389 | 48255 | 38729 | 31912 | 35696 |
| APP | 25604 | 18709 | 22107 | 46677 | 41708 | 47907 | 21038 | 19610 | 22815 | 44120 | ... | 39254 | 76234 | 72166 | 62192 | 33801 | 32410 | 28194 | 68815 | 56247 | 68841 |
| KRT14 | 66673 | 41095 | 56114 | 81476 | 75289 | 80292 | 58457 | 51209 | 61860 | 76687 | ... | 39070 | 31069 | 27874 | 24329 | 38139 | 42721 | 34589 | 27887 | 22419 | 28122 |
num [1:500, 1:36] 129671 73954 20807 95498 25604 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:500] "KRT6A" "EEF1A1" "PSAP" "LAMC2" ... ..$ : chr [1:36] "24h_DMSO-1" "24h_DMSO-2" "24h_DMSO-3" "24h_200nM17AAG-1" ...
Since we're here, let's take a quick step back and look at the heatmap from our new dataset. Does it reveal anthing to us now that it is more nuanced?
# Create a heatmap of the correlation values
heatmap.2(x=wyler_readcounts.mx, # Apply our correlation matrix
main = "Heatmap of RNA-Seq correlation on readcounts",
col = viridis(100), # Use the viridis colour scheme
cexRow = 2, # Set row text size
cexCol = 2, # Set column text size
lhei = c(2,10), # Alter our key heigh slightly
lwid = c(3,10),
margins=c(20,20), # Increase text margins because our PHU names can be big
trace = "none", # Get rid of the trace line down columns
dendrogram = "both" # Optional to turn off warnings
)
Looks like the heatmap still doesn't reveal a lot of information to us like how samples might be grouped. Let's proceed with k-means and see if that works better. We just need to repeat our steps on a different set of data. Since we now have our most diverse genes and their readcounts, we'll take a look at if we can cluster this information.
# How many clusters should we generate?
fviz_nbclust(t(wyler_readcounts.mx), kmeans, method="wss")
fviz_nbclust(t(wyler_readcounts.mx), kmeans, method="silhouette")
fviz_nbclust(t(wyler_readcounts.mx), kmeans, method="gap_stat")
# Compute k-means with k = 4
# Set a seed for reproducibility.
set.seed(123)
# Generate our k-means analysis
wyler_readcounts.km <-
kmeans(scale(t(wyler_readcounts.mx)), # We'll scale our data for this (more on that later too!)
centers = 4,
nstart = 25,
iter.max = 500)
# Take a look at the resulting k-means object
str(wyler_readcounts.km)
List of 9 $ cluster : Named int [1:36] 3 3 3 4 4 4 3 3 3 4 ... ..- attr(*, "names")= chr [1:36] "24h_DMSO-1" "24h_DMSO-2" "24h_DMSO-3" "24h_200nM17AAG-1" ... $ centers : num [1:4, 1:500] 0.378 -1.069 0.996 -0.233 -1.399 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:4] "1" "2" "3" "4" .. ..$ : chr [1:500] "KRT6A" "EEF1A1" "PSAP" "LAMC2" ... $ totss : num 17500 $ withinss : num [1:4] 1165 1661 1975 215 $ tot.withinss: num 5016 $ betweenss : num 12484 $ size : int [1:4] 6 12 12 6 $ iter : int 2 $ ifault : int 0 - attr(*, "class")= chr "kmeans"
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=15)
# Plot the cluster
# You can treat this object like a ggplot object afterwards
wyler_kmeans.plot <-
fviz_cluster(object = wyler_readcounts.km, # our k-means object
data = t(wyler_readcounts.mx), # Our original data needed for PCA to visualize
ellipse.type = "convex",
ggtheme = theme_bw(),
repel=TRUE, # Try to avoid overlapping text
labelsize = 20,
pointsize = 4,
main = "K-means clustering of filtered Wyler readcount data"
) +
# Set some ggplot theme information
theme(text = element_text(size=20)) +
# Set the colour and fill scheme to viridis
scale_fill_viridis_d() +
scale_colour_viridis_d()
# Print the plot!
wyler_kmeans.plot
Warning message: "ggrepel: 9 unlabeled data points (too many overlaps). Consider increasing max.overlaps"
Looking at the result, what's nice about this kind of output is that we can immediately see the separation or clustering of our data. It looks like 4 was the way to go as there are 4 tight groupings from our data. It might be possible to further subdivide the group in the top left of our figure but it's unlikely.
Based on our selection of genes, we observe:
These groupings suggest that perhaps the 24-hour SARS-CoV-infected timepoint is similar to uninfected controls. Meanwhile the 48- and 72-hour SARS-CoV-infected samples are similar but likely showing very distinct transcriptional changes due to infection. Treatment with the HSP90-inhibitor, however, appears to sufficiently alter expression profiles of infected and mock-infected cells to makes them cluster together. This would be a good starting point to being digging further into the data.
| Yes the data clusters, but why does it cluster? |
One problem we encountered in our above analysis of RNA-Seq data is that there were simply too many dimensions! With ~27k gene entries in our dataframe, using clustering to analyse the entire dataset would be memory-intensive if not impossible altogether. To circumvent this problem we attempted to subset our data in two ways - filtering by read counts, and comparing variance across datasets. These approaches were a form of dimensionality reduction - namely feature elimination. Our choices, however, may have inadvertantly been throwing away data that could prove insightful! This is where other dimensionality reduction methods can help guide our analyses.
You can achieve dimensionality reduction in three general ways:
| Technique | Description | Type | Useful for |
|---|---|---|---|
| Random forest | Popular machine learning algorithm for classification randomly chooses from subsets of features to classify data. The most frequently appearing features amongst the forests are chosen. | Feature selection | Only takes numeric inputs |
| PCA | Principal component analysis attempts to maximize variation when transforming to a lower-dimensional space. All new features are independent of one another. | Linear Feature Extraction | Actually really good at finding outliers in RNAseq data |
| t-SNE | t-Distributed Stochastic Neighbour Embedding is a non-linear technique similar to PCA suitable for high-dimension datasets. It attempts to minimize probabilities in mirroring nearest neighbour relationships in transforming from high to low-dimension spaces | Non-linear Feature extraction | Determine if your data has underlying structure |
| UMAP | Uniform manifold approximation and projection is projection-based like t-SNE, this technique attempts to preserve local data structure but has improved translation of global data structure. | Non-linear feature extraction | Faster than t-SNE |
Since we're not exactly building a classifier but rather trying to find trends in our data, we won't be looking at Random Forests here. Here we are interested in exploratory data analysis. We have a data set we want to understand, sometimes it is too complex to just project or divine 1) the underlying structure and 2) the features that drive that structure.
Going back to our PHU data, we used 7 dimensions to classify our data but what if we wanted to transform that information in some way to reduce the number of dimensions needed to represent their relationships? For our data set, 7 dimensions isn't a lot of features but in other cases (like RNA-Seq) you might encounter feature-dense datasets and without knowing a priori which ones are meaningful, PCA provides a path forward in classifying our data.
Now there are caveats to PCA. It produces linear feature extraction by building new features in your n-dimensional space such that each new feature (kind of like a projected line) maximizes variance to the original data points. Each new component must be uncorrelated with (ie perpendicular to) the previous ones while accounting for the next highest amount of variance. More resources on how this works in the references.
How do we go about maximizing the variance (spread of red dots) to our data features? Finding the point where variance is maximized, also minimizes error (red line lengths). Generated by user Amoeba on stackexchange.com |
All math aside, our goal is to reduce our feature set to something smaller by trying to represent our data with these new features. Just remember that highly variable data and outliers can dominate your principal components.
For simplicity let's head back and look at our PHU age group data again. To illustrate our example with PCA, let's use the original data normalized by PHU population.
head(covid_demographics_norm.mx)
| population_2022 | 0 to 4 | 5 to 11 | 12 to 19 | 20 to 39 | 40 to 59 | 60 to 79 | 80+ | |
|---|---|---|---|---|---|---|---|---|
| Algoma | 117840 | 3302 | 4890 | 4464 | 7570 | 4301 | 2331 | 4116 |
| Brant County | 153558 | 4481 | 5376 | 6499 | 9448 | 6429 | 3995 | 5684 |
| Durham Region | 711426 | 4669 | 5697 | 6407 | 11018 | 7333 | 4985 | 7562 |
| Grey Bruce | 176150 | 2324 | 3040 | 4643 | 5995 | 3489 | 2022 | 3056 |
| Haldimand-Norfolk | 120008 | 3002 | 4956 | 5388 | 9803 | 5799 | 3568 | 6261 |
| Haliburton, Kawartha, Pine Ridge District | 190728 | 2263 | 2865 | 3673 | 7368 | 3681 | 1923 | 3612 |
PCA() to generate our analysis¶To generate our analysis of the PHU data, we'll use the FactoMineR function PCA() for which there are some parameters we'll be using that we should discuss:
X a data frame of n rows (observations) and p columns (numeric variables)ncp the number of dimensions kept in the results (5 is the default)scale.unit a boolean to scale your data (TRUE by default)Remember that PCA is trying to maximize distance between a principle component and all of the observations supplied. Depending on the nature of your variables you may have, for instance, two different unit types like height and mass. Smaller changes in height may be matched with much larger changes in mass or just wider overall variance. This may lead the PCA algorithm to prioritize mass over height when you'd prefer they have an equal importance. By centering your mean and scaling data to unit variance, everything is compared as a z-score, bringing the overall variance across a variable to within ~3 standard deviations.
Let's compare PCA with and without scaling shall we?
# Build a PCA of our PHU data with scaling applied
phu_scaled.pca <- PCA(covid_demographics_norm.mx[, -1],
scale.unit = TRUE, # What happens when we don't scale the data?
ncp = 7,
graph = TRUE)
# Build a PCA of our PHU data WITHOUT scaling applied
phu_unscaled.pca <- PCA(covid_demographics_norm.mx[,-1],
scale.unit = FALSE,
ncp = 7,
graph = TRUE)
Warning message: "ggrepel: 9 unlabeled data points (too many overlaps). Consider increasing max.overlaps" Warning message: "ggrepel: 17 unlabeled data points (too many overlaps). Consider increasing max.overlaps"
# Take a look at the information inside our PCA object
print(phu_scaled.pcad.pca)
**Results for the Principal Component Analysis (PCA)** The analysis was performed on 34 individuals, described by 7 variables *The results are available in the following objects: name description 1 "$eig" "eigenvalues" 2 "$var" "results for the variables" 3 "$var$coord" "coord. for the variables" 4 "$var$cor" "correlations variables - dimensions" 5 "$var$cos2" "cos2 for the variables" 6 "$var$contrib" "contributions of the variables" 7 "$ind" "results for the individuals" 8 "$ind$coord" "coord. for the individuals" 9 "$ind$cos2" "cos2 for the individuals" 10 "$ind$contrib" "contributions of the individuals" 11 "$call" "summary statistics" 12 "$call$centre" "mean of the variables" 13 "$call$ecart.type" "standard error of the variables" 14 "$call$row.w" "weights for the individuals" 15 "$call$col.w" "weights for the variables"
Regardless of scaling, the result of our PCA() call produces an object with many variables we can access. Above you can see a brief description for each variable but we are most interested in a few particular ones:
eig holds our dimensional data but also describes just how much each new principle component describes the overall variation of our data.var holds the results of all the variables. We can use these to graph and visualize our data.ind$coord will allow us to plot the coordinates of our observations along the principal components.The eigenvalues from our analysis pair with the eigenvectors (principle components) to help transform our data from the original feature set to the new set of features. While the eigenvectors may determine the directions of the new feature space, the eigenvalue represents the magnitude of the vector and in this case can be used to calculate the percent of overall variance explained by our eigenvector.
The important take-away is that we can now see just how much of our variance is explained in each new principle component. We can access this information directly from phu_scaled.pca or by using the function get_eigenvalue(). We can also plot this as a barchart instead using fviz_eig().
# Look at our eigenvalues directly
phu_scaled.pca$eig
# Use get_eigenvalue() to look at our eigenvalues
get_eigenvalue(phu_scaled.pca)
| eigenvalue | percentage of variance | cumulative percentage of variance | |
|---|---|---|---|
| comp 1 | 5.57079932 | 79.5828474 | 79.58285 |
| comp 2 | 0.89375555 | 12.7679364 | 92.35078 |
| comp 3 | 0.26020469 | 3.7172099 | 96.06799 |
| comp 4 | 0.12869297 | 1.8384710 | 97.90646 |
| comp 5 | 0.08966823 | 1.2809747 | 99.18744 |
| comp 6 | 0.04430641 | 0.6329488 | 99.82039 |
| comp 7 | 0.01257282 | 0.1796117 | 100.00000 |
| eigenvalue | variance.percent | cumulative.variance.percent | |
|---|---|---|---|
| Dim.1 | 5.57079932 | 79.5828474 | 79.58285 |
| Dim.2 | 0.89375555 | 12.7679364 | 92.35078 |
| Dim.3 | 0.26020469 | 3.7172099 | 96.06799 |
| Dim.4 | 0.12869297 | 1.8384710 | 97.90646 |
| Dim.5 | 0.08966823 | 1.2809747 | 99.18744 |
| Dim.6 | 0.04430641 | 0.6329488 | 99.82039 |
| Dim.7 | 0.01257282 | 0.1796117 | 100.00000 |
See how nearly 80% of our variation is explained in just our first dimension? Let's use fviz_eig() to display this information visually.
# Adjust our plot window size according to the expected output
options(repr.plot.width=10, repr.plot.height=6)
# Visualize the impact of our eigenvalues
fviz_eig(phu_scaled.pca, addlabels = TRUE) + theme(text = element_text(size=20))
Looking at our eigenvalues we can see that even though 7 new PCs were generated, the first two explain almost 92% of our variance. That suggests that whatever linear separation in our data exists, we can recreate in a two-dimensional projection of coordinates from PC1 and PC2. This suggests that we can recreate the underlying structure of our data with these two new features instead of using a 7-dimensional space!
This makes some sense when we take a closer look at the data. For instance, there isn't much visual information found in our 0 to 4 age range. The small distribution of information in these features may not do much, in the grand scheme, to change how the PHUs are separated from each other. Then again, perhaps they do contain information that we simply cannot see easily! How will we know?
get_pca_var()¶Within our PCA, we can access information regarding how our original variables are transformed into the new space. This is all stored in the var element of our PCA object. We can extract the aspects of this using the get_pca_var() and visualize these to determine the quality of our variables and their representation by the new principal components.
# What is the information associated with our original variables
phu.var <- get_pca_var(phu_scaled.pca)
phu.var
Principal Component Analysis Results for variables =================================================== Name Description 1 "$coord" "Coordinates for the variables" 2 "$cor" "Correlations between variables and dimensions" 3 "$cos2" "Cos2 for the variables" 4 "$contrib" "contributions of the variables"
Each original variable may, to some degree, influence the amount of information captured into each new principal component. When we look at each we can see the breakdown of variables, which in some cases, can reveal to us where more or less variation is found as well.
# What is the contribution of each variable?
phu.var$contrib
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | Dim.6 | Dim.7 | |
|---|---|---|---|---|---|---|---|
| 0 to 4 | 10.34590 | 40.9538402 | 6.80231618 | 13.164244 | 2.341191e+01 | 4.1598042 | 1.1619876 |
| 5 to 11 | 13.92950 | 15.5247689 | 13.94117607 | 6.128455 | 4.187471e+01 | 7.7830343 | 0.8183575 |
| 12 to 19 | 15.22164 | 2.5003608 | 22.64386889 | 46.614495 | 1.175802e+01 | 0.2326117 | 1.0290033 |
| 20 to 39 | 16.20731 | 0.2393659 | 20.00827009 | 8.205234 | 1.759244e+01 | 37.3147126 | 0.4326714 |
| 40 to 59 | 16.52966 | 5.7071663 | 2.79224811 | 4.688363 | 6.551208e-04 | 18.9749394 | 51.3069659 |
| 60 to 79 | 15.84305 | 9.6533637 | 0.06091353 | 12.485206 | 3.162872e+00 | 14.7586453 | 44.0359452 |
| 80+ | 11.92294 | 25.4211341 | 33.75120714 | 8.714002 | 2.199400e+00 | 16.7762526 | 1.2150691 |
From above we can see that our original variables equally contribute to our first principal component but there is much more contribution in PC2 from three variables: 0 to 4, 5 to 11 and 80+. From our previous scree plot that means that 12.8% of overall variation, which is described in PC2, is mainly from these three groups.
From the remaining variable information coord and cor can both be used to plot our original variables along different pairs of principal components. These values are the same because this is not a projection of our variables onto the PCs but a correlation of them! The distance between the origin and the variable coordinates gauges the quality of the variables on the map. This number is summarized in the cos2 (squared cosine) element.
# Look at the coordinates of our variables across the PCs.
phu.var$coord
phu.var$cor
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | Dim.6 | Dim.7 | |
|---|---|---|---|---|---|---|---|
| 0 to 4 | 0.7591766 | 0.60500184 | 0.13304114 | -0.13015935 | 0.1448897723 | 0.04293088 | 0.012086961 |
| 5 to 11 | 0.8808998 | 0.37249629 | 0.19046153 | 0.08880817 | -0.1937738660 | -0.05872294 | -0.010143502 |
| 12 to 19 | 0.9208514 | 0.14948951 | -0.24273527 | 0.24492770 | 0.1026801084 | -0.01015194 | -0.011374302 |
| 20 to 39 | 0.9501982 | -0.04625307 | -0.22817199 | -0.10275972 | -0.1255978821 | 0.12857998 | -0.007375568 |
| 40 to 59 | 0.9596011 | -0.22584976 | -0.08523826 | -0.07767621 | -0.0007664433 | -0.09169032 | 0.080316458 |
| 60 to 79 | 0.9394598 | -0.29373027 | 0.01258967 | -0.12675797 | 0.0532549688 | -0.08086425 | -0.074408070 |
| 80+ | 0.8149864 | -0.47665795 | 0.29634815 | 0.10589763 | 0.0444090428 | 0.08621459 | 0.012359954 |
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | Dim.6 | Dim.7 | |
|---|---|---|---|---|---|---|---|
| 0 to 4 | 0.7591766 | 0.60500184 | 0.13304114 | -0.13015935 | 0.1448897723 | 0.04293088 | 0.012086961 |
| 5 to 11 | 0.8808998 | 0.37249629 | 0.19046153 | 0.08880817 | -0.1937738660 | -0.05872294 | -0.010143502 |
| 12 to 19 | 0.9208514 | 0.14948951 | -0.24273527 | 0.24492770 | 0.1026801084 | -0.01015194 | -0.011374302 |
| 20 to 39 | 0.9501982 | -0.04625307 | -0.22817199 | -0.10275972 | -0.1255978821 | 0.12857998 | -0.007375568 |
| 40 to 59 | 0.9596011 | -0.22584976 | -0.08523826 | -0.07767621 | -0.0007664433 | -0.09169032 | 0.080316458 |
| 60 to 79 | 0.9394598 | -0.29373027 | 0.01258967 | -0.12675797 | 0.0532549688 | -0.08086425 | -0.074408070 |
| 80+ | 0.8149864 | -0.47665795 | 0.29634815 | 0.10589763 | 0.0444090428 | 0.08621459 | 0.012359954 |
# What is the quality of our variables?
# The higher the value the better!
phu.var$cos2
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | Dim.6 | Dim.7 | |
|---|---|---|---|---|---|---|---|
| 0 to 4 | 0.5763492 | 0.366027221 | 0.0176999458 | 0.016941457 | 2.099305e-02 | 0.0018430600 | 1.460946e-04 |
| 5 to 11 | 0.7759844 | 0.138753484 | 0.0362755942 | 0.007886891 | 3.754831e-02 | 0.0034483833 | 1.028906e-04 |
| 12 to 19 | 0.8479673 | 0.022347113 | 0.0589204091 | 0.059989580 | 1.054320e-02 | 0.0001030619 | 1.293748e-04 |
| 20 to 39 | 0.9028766 | 0.002139346 | 0.0520624574 | 0.010559560 | 1.577483e-02 | 0.0165328107 | 5.439901e-05 |
| 40 to 59 | 0.9208343 | 0.051008116 | 0.0072655606 | 0.006033594 | 5.874353e-07 | 0.0084071150 | 6.450733e-03 |
| 60 to 79 | 0.8825848 | 0.086277474 | 0.0001584999 | 0.016067582 | 2.836092e-03 | 0.0065390264 | 5.536561e-03 |
| 80+ | 0.6642028 | 0.227202798 | 0.0878222244 | 0.011214308 | 1.972163e-03 | 0.0074329558 | 1.527685e-04 |
fviz_pca_var()¶To capture all of our information in a single visualization we can use fviz_pca_var() to assess our variables. High quality variables will be closer to the circumference of the circle, low quality correlations will be closer to the origin. As we will verify from above, most of our variables are positively correlated with PC1. Across both PC1/PC2, our variable correlate highly.
# Compare how our variables contribute and correlate with PC1/PC2
fviz_pca_var(phu_scaled.pca,
col.var = "contrib", # How will we colour our data/lines
gradient.cols = c("green", "yellow", "red"),
labelsize = 10,
repel = TRUE, # make sure text doesn't overlap
axes = c(1,2) # Determine which PCs you want to graph
) +
theme(text = element_text(size=20))
# Compare how our variables contribute and correlate with PC2/PC3
fviz_pca_var(phu_scaled.pca,
col.var = "contrib", # How will we colour our data/lines
gradient.cols = c("green", "yellow", "red"),
labelsize = 10,
repel = TRUE, # make sure text doesn't overlap
axes = c(2,3) # Determine which PCs you want to graph
) +
theme(text = element_text(size=20))
fviz_pca_ind()¶Now that we've generated our PCA, let's see if there is any clear separation between our PHUs across the first two dimensions of new features. Much like our variables, all of the individual coordinate data can be found in our ind element of our PCA object. Within there we'll find the coord information so we could directly build a ggplot with phu_scaled.pca$ind$coord BUT there's a simpler way.
We'll parse through the PCA object with fviz_pca_ind() to plot our data along the first two principle coordinates. We can define pointsize and col.ind to help add some population size information to our PHUs.
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=12)
# Graph our scaled PCA data.
fviz_pca_ind(phu_scaled.pca,
pointsize = covid_demographics_norm.mx[,1], # Set point size and colour by PHU population
col.ind = log10(covid_demographics_norm.mx[,1]),
repel = TRUE, # avoid overlapping text points
labelsize = 7,
axes = c(1,2)
) +
theme(text = element_text(size=20)) + # Make our text larger
scale_size(range = c(3, 10)) + # Update the point size range
scale_colour_viridis_c() # Change the colour scheme
Warning message: "ggrepel: 12 unlabeled data points (too many overlaps). Consider increasing max.overlaps"
# Graph our UNscaled PCA data.
fviz_pca_ind(...,
pointsize = covid_demographics_norm.mx[,1], # Set point size and colour by PHU population
col.ind = log10(covid_demographics_norm.mx[,1]),
repel = TRUE, # avoid overlapping text points
labelsize = 7,
axes = c(1,2)
) +
theme(text = element_text(size=20)) + # Make our text larger
scale_size(range = c(3, 10)) + # Update the point size range
scale_colour_viridis_c() # Change the colour scheme
# What are the contributions of each age group to the various unscaled dimensions
phu_unscaled.pca$var$contrib
# Look at our unscaled eigenvalues directly
phu_unscaled.pca$eig
As you can see from our graphing of both PCA sets, when we choose not to scale our data something a little more drastic happens. PC1's portion of variance increases to a 83.2% accounting of overall variance. We see many of our points move and separation between some of our PHUs is increased (Kingston/Frontenac and Peel). These changes are likely due to the larger range of values from the 20 to 39 age group which now contributes to 31% of PC1's variation.
So think carefully about your features and what they represent. Depending on their range, and unit values, you may be inadvertantly weighing them more than your other features! Scaling adjusts your data so that all features are weighted equally!
Now that we've effectively reduced the complexity of our data, can we produce the same kmeans clustering as before? Let's try to cluster just on the two dimensions presented, which we have already graphed!
# How many clusters should we use based on our PCA data?
fviz_nbclust(phu_scaled.pca$ind$coord[,1:2], kmeans, method="wss")
fviz_nbclust(phu_scaled.pca$ind$coord[,1:2], kmeans, method="silhouette")
# Generate our kmeans analysis and visualize it
set.seed(123)
phu_PCA_kmeans.plot <-
kmeans(..., centers = 3) %>%
fviz_cluster(., # our k-means object
data = ..., # Our original data needed for PCA to visualize
ellipse.type = "convex",
ggtheme = theme_bw(),
repel=TRUE, # Try to avoid overlapping text
labelsize = 20,
pointsize = 4,
main = "K-means clustering of PHU by normalized case number AFTER Principal Component Analysis"
) +
# Set some ggplot theme information
theme(text = element_text(size=20)) +
# Set the colour and fill scheme to viridis
scale_fill_viridis_d() +
scale_colour_viridis_d()
# phu_pca_kmeans.plot
# Plot both of our figures together
fig.show="hold"; out.width="50%"; out.height = "50%"
phu_kmeans.plot
phu_PCA_kmeans.plot
From our estimations, the clustering choices look near identical and this is not exactly the best dataset to work with since it is rather small but you can see that we have now transformed our data into just 2 features! Could we do the same using just 2 features from our original dataset? No! So imagine transforming a much larger and feature-heavy dataset into a smaller number of features. We are also able to extract, from the data just which of our original features may really contribute to each new principal component!
Something to consider the next time you're working with a large data set!
Now that we've walked through how to generate a PCA for a simpler dataset, let's return to our RNA-Seq readcount data. We won't trim down the data at all but rather just provide the entire dataset as a matrix for feature reduction. Let's review the steps:
# 1. Generate our matrix from the readcount data
wyler_readcounts_all.mx <-
wyler_readcounts.df %>%
# Rename the columns be removing the first portion: AECII_xx
rename_with(., ~ str_replace(string = .x,
pattern = r"(\w*_\d*_)",
replace = "")) %>%
select(c(3:38)) %>%
as.matrix() %>%
...
# Name the rows of the matrix
colnames(wyler_readcounts_all.mx) <- wyler_readcounts.df$gene
head(wyler_readcounts_all.mx)
# 2. Generate the PCA object
readcounts_scaled.pca <- PCA(...,
scale.unit = TRUE, # What happens when we don't scale the data?
ncp = 40,
graph = TRUE)
# 3. Review how much variance is explained by components
readcounts_scaled.pca$eig
# 4. How many clusters should we use based on our PCA data?
fviz_nbclust(readcounts_scaled.pca$ind$coord[,1:2], kmeans, method="silhouette")
# 5. Generate our kmeans analysis and visualize it
set.seed(123)
readcounts_PCA_kmeans.plot <-
# Let's go with the suggested silhouette number of 5 clusters
kmeans(readcounts_scaled.pca$ind$coord[,1:2], centers = ...) %>%
fviz_cluster(., # our k-means object
data = readcounts_scaled.pca$ind$coord[,1:2], # Our original data needed for PCA to visualize
ellipse.type = "convex",
ggtheme = theme_bw(),
repel=TRUE, # Try to avoid overlapping text
labelsize = 20,
pointsize = 4,
main = "K-means clustering of PHU by normalized case number AFTER Principal Component Analysis"
) +
# Set some ggplot theme information
theme(text = element_text(size=20)) +
# Set the colour and fill scheme to viridis
scale_fill_viridis_d() +
scale_colour_viridis_d()
# Look at our plot
readcounts_PCA_kmeans.plot
| How do we identify trends or groups within deeply complex data in an unsupervised manner? |
So we just spent most of our time trying to understand how to transform our sets based on the large amounts of variation in our data. There are some limitations we've discussed but in some cases, depending on your dataset size you may be interested in finding small, local similarities rather than the large variation that comes with PCA.
There are two popular projections we'll discuss today but first let's put together a more complex dataset based on some RNAseq data in GSE150316_DeseqNormCounts_final.txt and it's companion file 2020.07.30.20165241-supp_tables.xlsx
# Read in our RNAseq data
tissue_data.df <- ...(file = ...,
header = TRUE,
row.names = 1)
# Take a quick look at it
head(tissue_data.df)
dim(tissue_data.df)
# Read in some additional patient data
patient_data.df <- read_excel("./data/2020.07.30.20165241-supp_tables.xlsx", sheet=2)
# Take a quick look at it
head(patient_data.df)
dim(patient_data.df)
First let's examine our patient_data.df (24 patient samples listed in total) and reformat the column names a bit before selecting for just the viral load and viral load percentage information located in the 2nd and 3rd column. We'll hold onto this in patient_viral_load.df for later use.
# Create a dataframe holding just the viral_load information for each patient
patient_viral_load.df <-
patient_data.df %>%
# Retain just the first 3 columns
select(1:3) %>%
# Rename the 2nd and 3rd columns
...(case_num = 1, viral_load = 2, viral_load_percent = 3)
head(patient_viral_load.df)
We now want to format tissue_data.df much in the way we did with our PHU data. We want to convert our current data which lists genes as observations and tissue samples as columns. Essentially, we'd like to transpose this and we could do that but the transposition converts everything to a matrix. In the end, we want to work with a data frame so we can hold more information.
First, however, to reduce on memory and runtime, we should trim our dataset. We aren't really interested in looking at 59090 genes - many of which may be barely expressed. Since these are normalized counts across the dataset, we can filter out low-expression genes to make tissue_data_filtered.df. Yes, this would again be considered a form of feature elimination. In general we will:
# Trim the tissue data down
tissue_data_filtered.df <-
tissue_data.df %>%
# Convert the row names to a column
rownames_to_column(var="gene") %>%
# Set up the table to perform row-wise operations
rowwise() %>%
# Calculate the mean expression of each gene across all tissue samples
mutate(mean = ...(c_across(where(is.numeric)))) %>%
# Filter for samples with low expression
filter(mean > 0.5) %>%
# Calculate overall variance in case we need to make our dataset smaller
mutate(variance = ...(c_across(where(is.numeric)))) %>%
# Arrange samples by descending variance
arrange(desc(variance)) %>%
# Remove the grouping specification
ungroup()
# Take a look at the final results
head(tissue_data_filtered.df)
# how big is our filtered data frame?
dim(tissue_data_filtered.df)
Now that we've filtered our data down to ~29k genes, we'll run through two more steps:
pivot_longer() and pivot_wider().# You need to transpose the data.
# We can do it with dplyr to keep it as a data frame and to add some info
tissue_RNAseq.df <-
tissue_data_filtered.df %>%
# trim down the columns
select(1:89) %>%
# pivot longer
pivot_longer(cols=c(2:89), names_to = "sample", values_to = ...) %>%
# redistribute the gene names to columns
pivot_wider(names_from = gene, values_from = ...)
# We want to add some additional sample information before assessing the data
tissue_RNAseq.df <-
tissue_RNAseq.df %>%
# Grab just the sample names
select(sample) %>%
# Grab information from it like case number, tissue, and tissue number
# takes the form of "caseX.tissueY" or "caseX.tissue.NYC" or "NegControlX"
str_match_all(., pattern=r"(case([\w]+)\.([a-z]+)([\d|\.NYC]*)|(NegControl\d))") %>%
# Bind it all together
do.call(rbind, .) %>%
# Convert the results to a data frame
as.data.frame() %>%
# Rename the columns based on the capture groups
rename(., sample = V1, case_num = V2, tissue = V3, tissue_num = V4, neg_num = V5) %>%
# Coalesce some of the info due to negative control samples and clean up a column
mutate(case_num = coalesce(case_num, neg_num),
tissue_num = str_replace_all(.$tissue_num, pattern = "\\.", replace = "")) %>%
# Drop the neg_num column
select(1:4) %>%
# Join this result to the RNAseq info
full_join(., y=tissue_RNAseq.df, by=c("sample" = "sample")) %>%
# Join that result to grab viral load information
right_join(patient_viral_load.df, y=., by=c("case_num" = "case_num"))
#Look at the resulting dataframe
head(tissue_RNAseq.df)
# How many tissue types do we have?
table(tissue_RNAseq.df$tissue)
# Generate a matrix version of our data but drop the sample information!
tissue_RNAseq.mx <- as.matrix(tissue_RNAseq.df[,...])
head(tissue_RNAseq.mx)
dim(tissue_RNAseq.mx)
Rtsne package¶We now have a somewhat more complex dataset. We are still short on actual samples (now observations) but 90 observations and nearly 30K features isn't so bad. A broad question we may wish to ask with such a data set is, if there is an underlying structure to these samples - such as do we see grouping based on tissue type, or perhaps even sample preparation.
t-Distributed Stochastic Neighbour Embedding or t-SNE is a way for us to project our high-dimension data onto a lower dimension with the aim at preserving the local similarities rather than global disparity. When looking at data points, t-SNE will attempt to preserve the local neighbouring structure. As the algorithm pours over the data, it can use different transformations for different regions as it attempts to transform everything to a lower dimension. It is considered "incredibly flexible" at finding local structure where other algorithms may not.
This flexibility is accomplished through 2 steps:
We'll discuss more about how this algorithm affect interpretation after seeing the results, but this is considered an exploratory data visualization tool, rather than explanatory.
To produce our t-SNE projection we'll use the Rtsne() function from the package of the same name. Some important parameters are:
X is our data matrix where each row is an observationdims sets the number of dimensions we'd like to project onto (default is 2).initial_dims sets the number of dimensions that should be retained in the initial PCA step (Default 50).perplexity a numeric parameter that tunes between local and global aspects of your data.pca_scale is a boolean to set if the initial PCA step should use scaled data.max_iter is the maximum number of iterations in the algorithm (default is 1000).# Rtsne prefers using a matrix for memory issues
# set a seed for reproducible results
set.seed(1981)
# Try for perplexity of 30 can go as high as 29 before crash
# We have just 90 samples, but between 1-52 samples per "group"
tissue_tsne <- Rtsne(...,
dims=2,
perplexity=5,
verbose=TRUE,
pca_scale = TRUE,
max_iter = 1500)
str(...)
Looking above at the result of our t-SNE analysis, we can notice a few things...
Y is an 88x2 matrix holding the coordinates for our new projection.That's right, there is no underlying information for mapping back to our original dataset. It's a completely black box with no way to reverse engineer the process. That's because the process itself is stochastic! Whereas PCA was a deterministic process - repeatable the same way every time with the same data - that is not the case for t-SNE. That's why even though it can be quite powerful in identifying local similarities, it does not provide a mathematical pathway back to our original data!
Let's extract the information, combine it with our sample information and project it using ggplot2
# Build our new data frame from the Y values
tissue_tsne.df <- data.frame(x.coord = ...,
y.coord = ...)
# Add in our sample information
tissue_tsne.df <- cbind(tissue_RNAseq.df[,1:6], tissue_tsne.df)
# Fix up the information just a little bit to remove NA viral load information
tissue_tsne.df <-
tissue_tsne.df %>%
# replace NAs with DNW (did not work)
mutate(viral_load = replace_na(viral_load, replace = "DNW"))
head(tissue_tsne.df)
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(12, "Set3"), brewer.pal(8, "Set1"))
# 1. Data
ggplot(data = tissue_tsne.df) +
# 2. Aesthetics
aes(x = x.coord,
y = y.coord,
colour = ...) +
# Themes
theme_bw() +
theme(text = element_text(size=20)) +
# 3. Scaling
scale_colour_manual(values = combo.colours) +
# 4. Geoms
geom_text(aes(label = case_num), size = 10)
While we don't have a lot of samples, you can still see that we were able to cluster some of our data by cell types without providing that classification to the algorithm! Great job team!
We can see that we get close clustering of tissues like placenta, heart, and bowel. Our liver samples are kind of everywhere but perhaps using a different perplexity would provide different results.
One interesting thing we can see is that regardless of tissue type, we see some samples are clustering together based on case number - namely case numbers 1, 4 and 5 seem to have some strong underlying expression profiles that connect them across tissue samples. We may also be seeing false relationships so beware!
This algorithm for projection is in the same flavour as t-SNE projection but has some differences including:
What does that mean for us? Faster results, and more interpretive results! Otherwise the same issues can apply. The setup is slightly easier with few options to change if you leave the defaults. We can access umap() from the package of the same name. You may also alter the default methods by creating a umap.defaults object. More information on that here
For more tinkering, you can choose to use the uwot package instead where the umap() function has more options that are easily modified.
# Set our seed
set.seed(1981)
# Generate our projection
tissue_umap <- ...(tissue_RNAseq.mx)
str(tissue_umap)
Looking at our UMAP object tissue_umap we see it houses the projection parameters used but also some additional variables:
data: holds our original data matrix.layout: contains the projection coordinates we need for plotting the data.knn: a weighted k-nearest neighbour graph. This is a graph that connects each observation to its nearest k neighbours. This is generates the first topological representation of the data - like an initial sketch.You may notice again that there is no data that suggests how we arrived at this solution. There are no eigenvectors or values to reverse the projection!
Let's extract the layout information, combine it with our sample information and project it using ggplot2
# Re-map our projection points with our tissue data
tissue_umap.df <- data.frame(x.coord = tissue_umap$...[,1],
y.coord = tissue_umap$...[,2])
tissue_umap.df <- cbind(tissue_RNAseq.df[,1:6], tissue_umap.df)
tissue_umap.df <-
tissue_umap.df %>%
# replace NAs with DNW (did not work)
mutate(viral_load = replace_na(viral_load, replace = "DNW"))
head(tissue_umap.df)
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(12, "Set3"), brewer.pal(8, "Set1"))
# 1. Data
ggplot(data = ...) +
# 2. Aesthetics
aes(x = x.coord,
y = y.coord,
colour = tissue) +
# Themes
theme_bw() +
theme(text = element_text(size=20)) +
# 3. Scaling
scale_colour_manual(values = combo.colours) +
# 4. Geoms
geom_text(aes(label = case_num), size = 10)
So it looks like without much tinkering we retrieved a fairly nice result. We can see now that liver and kidney are more closely grouped as tissues, while heart samples generally cluster together still. Bowel and jejunum appear spatially grouped and our placenta samples are still close to each other. The clustering we saw with samples 1, 4, and 5 appear to be less severe.
There does appear to be some structure between the lung samples in different case numbers so this might be an avenue to explore next to try and see if there truly is a relationship between these groups.
| While t-SNE and UMAP produce projections do produce clustered data, you have no route back to understanding their relationships. PCA, on the other hand, is strictly a dimension reduction tool. It does not place or assign datapoints to any groups BUT it is useful to use on large datasets prior to clustering! |
Today we took a deep dive into principal component analysis. There are of course different variants of this based on the assumptions you can make about your observations and variables like independent component analysis (ICA, non-Gaussian features) and multiple correspondence analysis (MCA, categorical features). Some additional methods can also be used to store the transformation like a PCA does, notably variational autoencoders (VAE).
Overall we should remember that while PCA can have problems in generating it's feature extraction, it is deterministic and repeatable. Also, the final results are provided in such a way that new observations could be transformed and projected onto the same principal components. You can also feed these components back into clustering algorithms like k-means to try and identify specific subgroups.
t-SNE and UMAP, on the other hand appear to do a much better job with high-dimensional data. They can preserve local structure and UMAP can also do a fairly good job of preserving global structure. These tools make for great exploratory analysis of your complex datasets. Interpretation of relationships, however, are not mathematically clear like in PCA. These are, after all projections from a higher dimension for our simpler primate brains!
This week's assignment will be found under the current lecture folder under the "assignment" subfolder. It will include a Jupyter notebook that you will use to produce the code and answers for this week's assignment. Please provide answers in markdown or code cells that immediately follow each question section.
| Assignment breakdown | ||
|---|---|---|
| Code | 50% | - Does it follow best practices? |
| - Does it make good use of available packages? | ||
| - Was data prepared properly | ||
| Answers and Output | 50% | - Is output based on the correct dataset? |
| - Are groupings appropriate | ||
| - Are correct titles/axes/legends correct? | ||
| - Is interpretation of the graphs correct? |
Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.
You can save and download the Jupyter notebook in its native format. Submit this file to the the appropriate assignment section by 1:59 pm on the date of our next class: April 14th, 2022.
Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
More information on calculating optimal clusters: https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/
Step-by-step how PCA works: https://builtin.com/data-science/step-step-explanation-principal-component-analysis
More PCA explanations here: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
The math of PCA: https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf
t-SNE in R and Python: https://datavizpyr.com/how-to-make-tsne-plot-in-r/
All about UMAP: https://umap-learn.readthedocs.io/en/latest/basic_usage.html